Skip to content

Bridge instantiate_utils: drop unexpected config keys with warning#1203

Merged
yaoyu-33 merged 10 commits into
mainfrom
yuya/update-init-utils-allow-unexpected
Nov 26, 2025
Merged

Bridge instantiate_utils: drop unexpected config keys with warning#1203
yaoyu-33 merged 10 commits into
mainfrom
yuya/update-init-utils-allow-unexpected

Conversation

@yaoyu-33

@yaoyu-33 yaoyu-33 commented Nov 6, 2025

Copy link
Copy Markdown
Contributor

Risks of Using Lenient Mode for Config Handling**

Using lenient mode when loading configs (instead of strict) can avoid errors when config fields are missing or changed, but it introduces several concerns:

Pros

  • Prevents errors when config keys are missing or removed across versions.
  • Makes backward compatibility easier initially.

⚠️ Risks / Potential Issues

  1. Hidden Config Mismatches

    • Missing or outdated config fields won’t throw errors, so silent differences may go unnoticed.
    • This can lead to unexpected behaviors and makes debugging harder.
  2. Reproducibility Concerns

    • With lenient mode, there are no guarantees that runs are reproducible across PyTorch/TE/MCore versions.
    • Config drift over time becomes hard to track.
  3. Version Guarding Becomes MCore’s Problem

    • If configs are dropped or renamed, a lenient approach pushes the burden to MCore side to guard changes properly.
  4. Technical Debt Over Time

    • Without strict checking, config schema changes accumulate “silent failures”.
    • Eventually, migrations or retroactive fixes become harder.
  • Introduces defensive handling in megatron.bridge.utils.instantiate_utils.instantiate_node to gracefully ignore unexpected config keys when instantiating targets.

  • Emits a logging.WARNING identifying dropped keys and the config full_key to aid debugging.

  • Adds unit tests verifying both the drop-and-warn behavior and that targets accepting **kwargs are unaffected.

  • Why

    • Backward compatibility with older checkpoints/config YAMLs that may contain stale fields (e.g., use_megatron_fsdp) caused failures like:
      • TypeError("OptimizerConfig.init() got an unexpected keyword argument 'use_megatron_fsdp'")
    • We prefer a non-fatal path with clear warnings rather than hard failures during resume/migration.
  • What changed

    • instantiate_utils.py
      • New _filter_kwargs_for_target(target, kwargs, full_key):
        • Uses inspect.signature to determine accepted keyword parameters.
        • If the target has **kwargs, forwards all keys unchanged.
        • Otherwise, drops keys not present in keyword-capable params and logs a warning including the dropped key list and full_key.
      • Integrated the filter just before calling the resolved _target_.
    • Tests
      • tests/unit_tests/bridge/test_instantiate_utils.py:
        • test_drops_unexpected_kwargs_and_warns: ensures unexpected keys are removed and a warning mentioning them is emitted.
        • test_allows_kwargs_when_target_accepts_var_kwargs: ensures no warning is produced and keys are preserved when the target accepts **kwargs.

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
@copy-pr-bot

copy-pr-bot Bot commented Nov 6, 2025

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
…None in instantiate_utils.py and now let exceptions propagate, aligning with the new filtering behavior.

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
Comment on lines +383 to +389
if mode == InstantiationMode.LENIENT:
# Warn and drop the unexpected keys
warning_msg = f"Dropping unexpected config keys for target '{target_str}': {sorted(unexpected)}"
if full_key:
warning_msg += f"\nfull_key: {full_key}"
logging.warning(warning_msg)
return {k: v for k, v in kwargs.items() if k in allowed_keys}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this avoids erroring out when fields are removed from the config. but in the case of changes like this, the semantics are lost: NVIDIA/Megatron-LM#1917

for example:

  • a user previously used set external_cuda_graph=True and the config did not yet have cuda_graph_scope and has checkpoints saved with this config
  • in a newer version where external_cuda_graph is removed, this PR will drop the arg in lenient mode.

it might not always be possible to infer the new setting from an old one so this might be inevitable, but we should make no claims about full reproducibility across versions in this case

@@ -0,0 +1,85 @@
#!/usr/bin/env python3

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this file can be removed now

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
@yaoyu-33

Copy link
Copy Markdown
Contributor Author

/ok to test 70be68d

ananthsub
ananthsub previously approved these changes Nov 12, 2025
@yaoyu-33

Copy link
Copy Markdown
Contributor Author

/ok to test d96f333

@yaoyu-33

Copy link
Copy Markdown
Contributor Author

/ok to test 6d92b19

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
@yaoyu-33

Copy link
Copy Markdown
Contributor Author

/ok to test dc5e056

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants